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Abstract 


Coronaviruses (CoVs) are single-stranded RNA vi- 
ruses which contain the largest RNA genomes, and se- 
vere acute respiratory syndrome _ coronavirus 
(SARS-CoV), a newly found group 2 CoV, emerged as 
infectious disease with high mortality rate. In this 
study, we compared the synonymous codon usage 
patterns between the nucleocapsid and spike genes of 
CoVs, and C-type lectin domain (CTLD) genes of hu- 
man and mouse on the codon basis. Findings indicate 
that the nucleocapsid genes of CoVs were affected 
from the synonymous codon usage bias than spike 
genes, and the CTLDs of human and mouse partially 
overlapped with the nucleocapsid genes of CoVs. In 
addition, we observed that CTLDs which showed the 
similar relative synonymous codon usage (RSCU) pat- 
terns with CoVs were commonly derived from the hu- 
man chromosome 12, and mouse chromosome 6 and 


12, suggesting that there might be a specific genomic 
region or chromosomes which show a more similar 
synonymous codon usage pattern with viral genes. 
Our findings contribute to developing the codon-opti- 
mization method in DNA vaccines, and further study is 
needed to determine a specific correlation between the 
codon usage patterns and the chromosomal locations 
in higher organisms. 
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Introduction 


Coronaviruses (CoVs) which are included in the 
family Coronaviridae are enveloped and contain 
the largest RNA genomes with some reaching 
almost 30,000 nucleotides (Dimmock et a/., 2002). 
They primarily infect the upper respiratory and 
gastrointestinal tract of animals, and severe acute 
respiratory syndrome coronavirus (SARS-CoV), a 
newly emerged group 2 CoV, spread rapidly from 
Asia to North America and Europe with a high 
degree of transmissibility and mortality (Lew et al., 
2003; Riley et al., 2003; Friman ef al., 2008). In 
response to the SARS pandemic in 2003, many 
scientists have been interested in vaccine develop- 
ment against SARS-CoV. The phase | human 
study for a SARS DNA vaccine was reported by 
Martin and his colleagues showing immunogenicity 
with spike proteins of SARS-CoV in all subjects 
and neutralizing antibody responses in 8 of 10 
subjects (Yang et al., 2005; Martin et a/., 2008). 
The DNA vaccination represents a new strategy 
for highly pathogenic and infectious diseases (Ra- 
makrishna et al., 2004; Martin et a/., 2006, 2007; 
Wang et al., 2006a, 2006b; Catanzaro et al., 2007), 
and it is usually produced in three successive 
steps. First, the primers specific to the target regions 
of viral genome are produced to generate cDNA 
fragments. Second, these cDNA fragments are 
inserted into a bacterial DNA vaccine plasmid such 
as escherichia coli plasmid, and lately, the pre- 
pared DNA vaccine is injected into the cells of the 
target organisms such as mouse, rabbit or human 
subjects to produce one or more specific proteins 
by mimicking viral replication and protein produc- 


tion in the host. Because these proteins are recog- 
nised as foreign antigens in the target organisms, 
immune responses are triggered by them (Sin and 
Weiner, 2000; Donnelly et al., 2003). The phase | 
Clinical trial of DNA vaccines against West Nile 
virus, Ebola virus and human immunodeficiency 
virus type 1 (HIV-1) in healthy adults have already 
been performed (Martin et al., 2006, 2007; 
Catanzaro et al., 2007). According to Wang et al. 
(2006a, 2006b) and Ramakrishna et al. (2004), the 
codon optimization of the Tat and envelope genes 
of HIV-1 as well as hemagglutinin genes of 
influenza A virus showed better antigen expression 
and immunogenicity in model animals such as 
mouse and rabbit. Each target gene of HIV-1 and 
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influenza A virus was known to be changed to the 
preferred codons of the overall mammalian system 
to promote better expression of each encoded 
protein. 

Synonymous codons usually encode common 
amino acids in protein synthesis, and they are not 
used randomly, with some codons being used 
more frequently than others (Moriyama and Hartl, 
1993; McInerney, 1998; Duret, 2002; Lynn et al., 
2002; Kawabe and Miyashita, 2003; Singer and 
Hickey, 2003). Codon usage bias has been known 
to mirror tRNA abundance in the early studies 
using Bacillus subtilis and Caenorhabditis elegans 
genes (Shields and Sharp, 1987; Stenico ef al., 
1994). In prokaryotes, such as thermophilic bacteria, 
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Figure 1. Principal component analysis of the % GC contents on the 1%, 2" and 3” codon position. The first two factors from the principal component 
analysis (PRIN1 and PRIN2) were presented with each eigenvalue proportion. Nucleocapsid (A) and spike (B) coding genes of Coronavirus genus were 
compared with CTLD genes of human (homo sapiens) and mouse (mus musculus) species (C, D). Family names of CTLDs were also presented with 
each plot. G1, Group 1 CoV; G2, Group 2 CoV; G3, Group 3 CoV; N, nucleocapsid gene of CoV; S, spike gene of CoV; homo, homo sapiens; mus, mus 


musculus. 
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highly expressed genes shift their codon usage 
toward a more restricted set of preferred synony- 
mous codons compared to less highly expressed 
genes within the genome (Lynn et al., 2002; Singer 
and Hickey, 2003). As for the viral genomes, Gu et 
al. (2004) reported that the relative synonymous 
codon usage (RSCU) values of Nidovirales family 
including SARS-CoV are virus-specific, and trans- 
lational selection and gene length may not affect 
the codon usage pattern in some viruses. However, 
Jenkins and Holmes (2003) who analyzed the 
extent of codon usage bias in the complete genomic 
coding region of 50 genetically and ecologically 
diverse human RNA viruses using the effective 
number of codon (ENC) as a parameter showed 
that the overall extent of codon usage bias was low 
and that there was little variation in bias between 
genes. More recently, Shackelton et al. (2006) 
reported that there was a striking difference in CpG 
content between DNA virus with large and small 
genomes as the majority of large genome viruses 
show the expected frequency of CpG, while most 
small genome viruses had CpG contents far below 
expected values. They suggested that the main 
reason for these differences might be due to the 
differences in the viral replication and repairing 
mechanisms, such as cellular or viral replicative 
machinery. In our previous study, synonymous 
codon usage patterns among RNA viruses such as 
influenza A viruses and HIV-1s were divided into 
each region, subtype, host or occurring-year group, 
with an expectation that there might be some 


Table 1. Eigenvectors and eigenvalues of the principal component 
analysis using the % GC contents on each codon position. 


Eigenvectors 


Species Variances 
PRIN1 PRIN2 
Coronavirus GCist 0.700134 0.188262 
(Nucleocapsid) GCana 0.710481 -0.087886 
GCorg -0.070915 0.978179 
Eigenvalue % 57.6 34.2 
Coronavirus GCist 0.628461 -0.383325 
(Spike) GCana 0.419316 0.899838 
GC3ra 0.655142 = -0.208216 
Eigenvalue % 60.8 27.8 
Human + Mouse GCist 0.629533 - 0.110096 
(C-type lectin GCona 0.525960 0.789002 
domain genes) GC3rq 0.571887 -0.604445 
Eigenvalue % 74.2 19.6 
Coronavirus+ GCist 0.645441 = -0.254191 
Human + Mouse GCona 0.630553 =-0.354859 
(Nucleocapsid, GCsrq 0.431057 0.899701 
Spike+C-type lectin Eigenvalue% 72.8 24.5 


domain genes) 


correlations between the nucleotide patterns and 
the direction of viral variations on the codon basis 
(Ahn and Son, 2006, 2007; Ahn et al., 2006). 
Furthermore, van Hemert et a/. (2007) reported 
that the recent evolution of astroviruses was 
associated with a switch in nucleotide composition 
and codon usage among non-human mammalian 
versus human/avian astroviruses. They suggested 
that evolutionary events within a virus family might 
be driven by forces operational at the level of 
synonymous substitutions, such as nucleotide 
composition, translational selection, and codon 
usage. 

In this study, we hypothesized that the codon 
usage bias of viral genes might tend to mimic the 
specific genes and perform a key role during the 
initial immune responses, in their host species. 
C-type lectins, a supeframily of proteins containing 
C-type lectin domains (CTLDs), are a large group 
of extracellular Metazoan proteins with diverse 
functions (Zelensky and Gready, 2005). They 
usually provide Ca**-dependent sugar-recognition 
activity and initiate a various kinds of biological 
processes, such as adhesion, endocytosis, and pa- 
thogen neutralization (Drickamer and Dodd, 1999; 
Dodd and Drickamer, 2001). As a point of immune 
responses, C-type lectins are also known to 
perform an important function in dendritic cell (DC) 
immune regulations, which include the triggering of 
inflammatory cytokines, as well as delivering 
antigens to T cell to initiate the specific immune 
response (Cella et al., 1997, 1999). C-type lectin 
receptors in DCs have been determined to act as a 
capture of attachment factor for influenza A virus 
(H5N1 subtype) or HIV-1 (Lambert et a/., 2008; 
Wang et al., 2008), and SARS-CoV infection is 
also known to induce a immune responses related 
with DC functions such as delaying an activation of 
alpha interferon (Spiegel et a/., 2006). In this study, 
we compared the synonymous codon usage 
patterns of Coronavirus genus with the CTLD 
genes of human (homo sapiens) and mouse (mus 
musculus) to investigate the possible relations 
between microbes and their host species in codon 
basis. 


Results 


Principal component analysis using the % GC 
contents on the 1*', 2" and 3" codon positions 


The first two principal factors of the % GC contents 
on each codon position from the nucleocapsid and 
spike genes of CoVs as well as the CTLDs of 
human and mouse were investigated using the 
principal component analysis (Figure 1). Eigen- 


vectors of each principal factor (PRIN1 and PRIN2) 
and eigenvalue proportions (%) were presented in 
Table 1. Among the CoV genes, the first two prin- 
cipal components of nucleocapsd genes accoun- 
ted for 57.6% and 34.2%, whereas those of spike 
genes accounted for 60.8% and 27.8% of the total 
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variance of the data set, respectively (Figure 1A 
and 1B). Eigenvector compositions of those two 
genes, however, showed different patterns. The % 
GC contents on the third codon position (GCara) 
among nucleocapsid genes showed highly positive 
correlations (0.978) along with PRIN2-axis, whereas 
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Figure 2. The results of phylogenetic analysis using the CTLD genes of human (homo sapiens) and mouse (mus musculus) species (A), and the scatter 
plots of the correspondence analysis using the relative synonymous codon usage values of the nucleocapsid and spike genes of CoVs as well as the 
CTLDs of human and mouse (B). Phylogram was derived by Neighbor-Joining method with bootstrap analysis of 1000 iterations, and bootstrap values (%) 
that are not 100% are represented as circulated numbers in each node. Each chromosome source of CTLD was also presented on the right column of 
tree. G1, Group 1 CoV; G2, Group 2 CoV; G3, Group 3 CoV; N, nucleocapsid gene of CoV; S, spike gene of CoV; CLEC, C-type lectin domain gene. 
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PRIN2 of spike genes was strongly dependent on 
the GC2ng (0.899) (Table 1). The former pattern 
was also appeared when two genes from CoVs, 
and CTLDs from human and mouse were analyzed 
together (Figure 1D, Table 1), whereas CTLD gene 
itself revealed very similar eigenvector patterns 
with those of spike genes of CoVs (Figure 1C, 
Table 1). The eigenvectors of PRIN1s in all the 
cases commonly showed positive correlation with 
on the GCist and GCana, with showing that PRIN1s 
were mainly dependent on the non-synonymous 
codon usage patterns of each gene. 

In Figure 1, we categorized the nucleocapsid 
and spike genes into each CoV groups such as 
group 1, 2 and 3, which were presented as G1, G2 
and G3, respectively. The nucleocapsid genes of 
both human and bat SARS-CoVs were located 
closely to the G3 CoVs such as infectious bronchitis 
virus along the PRIN1, but human SARS-CoVs 
displayed similar patterns with the G2 CoVs such 
as bovine, murine and rat CoVs along with the 
PRIN2 (Figure 1A). On the other hand, spike 
genes of SARS-CoVs were located near the G2 
murine hepatitis virus CoVs as well as G3 human 
CoV 229E (Figure 1B). Human CoV (HKU1) in G2 
CoVs were distinctly located from other G2 CoVs 
in both nucleocapsid and spike genes. As for the 
CTLD genes, mouse and human genes spread 
broadly across the biplots, and they were not 
clearly separated each other (Figure 1C). On the 
basis of the PRIN1-axis, the family 4Ms, 10As, 16A 
and 14A of human CTLDs as well as 3B, 4F and 
4G of mouse CTLDs were positively biased, and 
those genes also showed different % GC contents 
from CoV genes in Figure 1D. 


Phylogenetic relationships among CTLD genes of 
human and mouse species 


To compare the phylogenetic relationships among 
human and mouse CTLDs with the synonymous 
codon usage patterns, we constructed a phyloge- 
netic tree using the Neighbour-Joining method with 
1000 times bootstrapping test. All the genes were 
well grouped into each CTLD family, and human 
TAs, 1A, 14A, 2B and 16A CTLDs were separately 
located with other human CTLD genes, showing 
closer relationships with mouse genes (Figure 2A). 
Among the mouse genes, family 4A2, 4N, 4D and 
4G CTLDs showed close relationships with human 
family 4As, 4Cs, and 4Ms. Family 10As of human 
genes were distinctly located from other families. On 
the basis of the chromosomes which each gene was 
transcribed from, mouse CTLDs were encoded from 
chromosome 6, 8, 9 and 12, and human genes were 
from chromosome 12, 14, 16, 17 and 19. 


Synonymous codon usage analysis using the CA 
method 


To investigate the synonymous codon usage pa- 
tterns, we parsed each nucleotide sequence into 
each synonymous codon groups first, then, calcu- 
lated the RSCU values per each sequence. After 
that, we assigned each kind of gene or species as 
rows, and RSCU values of 59 codons as columns 
for the CA. All the target sequences of CoVs, 
human and mouse species were analyzed together 
to compare the overall synonymous codon usage 
patterns (Figure 2B). First of all, the nucleocapsid 
and spike genes of CoVs showed opposite patterns 
along with the first dimensional factor (Dim1) of CA 
plots, and human and mouse CTLDs were located 
on the same side with nucleocapsid genes. Secon- 
dly, we also performed linear regression analysis to 
identify which codon usage parameters affect the 
Dim1 and Dim2 of CA result most (Table 2). The 
Dim1 showed the significant correlations with all 
the codon usage parameters such as GCist, GCang 
and GC3rq and ENCs. Among the % GC contents, 
Dim1 was strongly dependent on the GCis and 
GCona, showing R* values of 0.781 and 0.671, 
respectively. As for the Dim2, however, only GC3rq 
showed positive correlations (R’=0.500) among all 
the % GC contents. 

In Figure 2B, we also presented the enlarged 
region of CTLDs of both human and mouse 
species with each family, member and transcript 
variant (if exists) name 1 (Figure 2B right). The 
CTLD genes which were located within or near the 
CA plots of CoVs were clustered as ‘group 1’, and 
they were presented as the underlined italic cha- 
racters in phylogenetic tree (Figure 2A). Intere- 
stingly, the group 1 CTLDs of human species were 
derived from the chromosome 12, when those of 
mouse were from the chromosome 6 and 12. 

The CTLDs can be divided into seven groups 
based on their domain architecture, and seven new 
groups were added in his revised article in 2002 
(Drickamer, 1993; Drickamer and Fadden, 2002; 
Zelensky and Gready, 2005). Among CTLD genes, 
human clec4C_1 and mouse clec14A showed 
very close relationships with the nucleocapsid 
genes of SARS-CoVs, whereas human clec4A_4, 
14A and 10A_2 as well as mouse clec4A2, 4G and 
4F were located far from SARS-CoVs on the basis 
of both Dim1 and Dim2. 


Comparison of RSCU values between SARS-CoVs 
and the most similar CTLD genes of human and 
mouse 


In the CA result, human clec4C_1 and mouse 
clec14A showed very close relationships with the 


Table 2. The results of the regression analysis between each 
dimensional factor of correspondence analysis using RSCU val- 
ues of CoVs and each codon pattern parameters. 


DIM1? DIM2° 
Variances 
R-Square’ Parameter R-Square Parameter 

estimate estimate 
GCist 0.7809* 0.02896 0.0016 -0.00105 
GCona 0.6710* 0.03903 0.0135 -0.00440 
GCara 0.3497* 0.01701 0.5003* 0.01620 
ENC® 0.5677* 0.03538 0.1070* 0.01168 


“First dimensional factor of the correspondence analysis, *Second dimen- 
sional factor of the correspondence analysis, °R’ value of each linear re- 
gression analysis, *Parameter estimate which was resulted from linear re- 
gression analysis, “Effective number of codons. *P < 0.0001. 


nucleocapsid genes of SARS-CoV (Figure 2B). So 
we compared the RSCU profiles of each gene to 
analyze the different patterns more _ intensively 
(Figure 3). The nucleocapsid and spike genes of 
human and bat SARS-CoVs (Figure 3A and 3B) 
were compared with the two types of CTLD genes 
such as human clec4C_1 and mouse clec14A 
which were included in 'group 1' in Figure 2B, and 
human clec10A_2 and mouse clec4F which were not 
included in group 1 (Figure 3C and 3D). First of all, 
nucleocapsid genes of human and bat SARS-CoVs 
did not use two synonymous codons for cysteine 
(CYS) as well as ACG for threonine (THR), and 
showed similar patterns with the spike genes in 
alanine (ALA), asparagine (ASN), glutamine (GLN), 
glycine (GLY), proline (PRO), serine (SER) and 
threonine encoding codon groups (Figure 3A and 
3B). In the phenylalanine (PHE) and the first three 
codons of leucine (LEU) encoding codons, however, 
nucleocapsid and spike genes showed somewhat 
opposite patterns from each other, and spike 
showed more biased patterns in CCU and UCU 
which encode proline and serine, respectively. 
Secondly, human and mouse CTLD genes in 
Figure 3C and 3D showed common RSCU profiles 
with SARS-CoVs in the alanine and proline enco- 
ding codon groups, but used different patterns 
from SARS-CoVs in glycine and phenylalanine 
codon groups. As for the human clec4C_1 and 
mouse cleci4A, they showed similar RSCU pro- 
files with spike genes in the arginine and serine 
coding groups, and human clec4C_1 alone showed 
the same patterns with nucleocapsid genes in the 
isoleucine (ILE) and serine encoding groups. The 
human clec10A_2 and mouse clec4F showed 
more biased patterns in leucine and_ serine 
encoding codons than those of the group 1 CLECs 
(Figure 3D). 
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Discussion 


Codon usage bias has been studied in various 
organisms ranging from virus to eukaryote, and 
optimized codon usages in the target viral genes 
also have been made to improve the efficacy of 
DNA vaccines development (Ramakrishna et al., 
2004; Shackelton et a/., 2006; van Hemert ef al., 
2007; Wang et al., 2006a, 2006b). Based on our 
previous studies, synonymous codon usage itself 
among RNA viruses such as influenza A viruses 
and HIV-1s revealed specific bias on the basis of 
each region, subtype, host or occurring-year group, 
suggesting that there might be some correlations 
between the codon usage patterns and viral 
variations in the codon basis (Ahn and Son, 2006, 
2007; Ahn et al., 2006). 

In this paper, we determined whether % GC 
contents on the first (GCist), second (GC2na), and 
third (GC3ra) codon positions showed similar pa- 
tterns among the same genes of viral species as 
well as the CTLDs of human and mouse (Figure 1). 
Among the two genes of CoVs, the nucleocapsid 
genes showed highly positive eigenvectors of GC3ra 
(0.978) along with the PRIN2, and this pattern was 
also observed when we compared all the target 
genes from CoVs, human and mouse together using 
the principal factor analysis (Table 1). Traditionally, 
spike protein is known to define the viral tropism by 
its receptor specificity and also by its membrane 
fusion activity during virus entry into cells, so it has 
been the major target of neutralizing antibodies in 
vaccine development (Gallagher and Buchmeier, 
2001). Recently, however, the nucleocapsid also 
has been studied as a new viral target protein in 
vaccine industry because of its good immuno- 
genicity (Bode et al., 2003; Ye et al., 2007). As for 
hepatitic C virus (HCV), the nucleocapsid protein is 
known to play an important role in immune eva- 
sion, including the inhibition of IFN-a-induced 
tyrosine phosphorylation, and activation of STAT1 
in hepatic cells (Bode et a/., 2003). The nucleo- 
capsid protein of SARS-CoV itself has become a 
potential candidate for DNA vaccine production 
because it revealed a critical role in viral infection 
process (Zhu et al/., 2004; Zhao et al., 2007; Mark 
et al., 2008; Schulze et a/., 2008). Ye et a/. reported 
that the nucleocapsid gene of mouse hepatitis 
virus A59, a group 2 CoV, circumvented the effects 
of the type | interferon (2007). Furthermore, Okada 
and his colleagues reported that mice vaccinated 
with the nucleocapsid protein of SARS-CoV 
showed T-cell immune responses, and Gao resulted 
that SARS DNA vaccine encoding nucleocapsid 
protein generated INF-y producing T-cells in rhesus 
monkeys (Gao et al., 2003; Okada et al., 2005). 
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Relative synonymous codon usage values 
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Figure 3. The profiles of the relative synonymous codon usage were shown as the vertical bar graph. The nucleocapaid (A) and spike (B) genes of hu- 
man and mouse SARS-CoVs, as well as the human and mouse CTLD genes which were located near the nucleocapsid genes of SARS-CoVs (C) and 
were located far from those of SARS-CoVs (D) in the correspondence analysis in Figure 2B are presented. Genbank accession numbers are presented in 
legends. G1, Group 1 CoV; G2, Group 2 CoV; G3, Group 3 CoV; N, nucleocapsid gene of CoV; S, spike gene of CoV; CLEC, C-type lectin domain gene. 


Our finding demonstrated that GCaq of nucleo- 
capsid genes revealed highly positive relationships 
along with the RPIN2 among CoV species, 
whereas spikes did not show any specific patterns 
related to GC. This result implicates that the 
nucleocapsid genes of CoVs might be more heavily 
affected by the synonymous codon usage bias 


which is usually determined by the nucleotide on 
the third codon position than spike genes (Figure 
1A and 1B). 

In order to compare the synonymous codon 
usage patterns among two genes of CoVs as well 
as CTLDs of human and mouse more intensively, 
we calculated the RSCU values of all the target 


genes from CoVs, human and mouse, and then, 
analyzed the Euclidean distances using the CA 
(Figure 2B). As a result, the nucleocapsid genes of 
SARS-CoVs from both human and bat showed the 
most biased patterns (0.292) among CoVs along 
with Dim2, which showed the significant correla- 
tions with the GC3rq (R?=0.50, P < 0.0001) of those 
genes in linear regression test (Table 2), and 
CTLDs of both human and mouse were broadly 
distributed on the first quadrant (Figure 2B). Inte- 
restingly, the group 1 CTLDs of human species 
were derived from the chromosome 12, and those 
of mouse were from the chromosome 6 and 12, 
whereas other CTLDs were from chromosome 14, 
16, 17 or 19 for human, and 6, 8 or 9 for mouse. 
Our finding suggests clue that there might be a 
specific genomic region or chromosomes, which 
show a more similar synonymous codon usage 
pattern with antigenic viral genes. Recently, DNA 
vaccine has become a more and more important 
part of vaccine development against many infectious 
viruses (Martin et al., 2006, 2007; Catanzaro et al., 
2007), and the codon-optimization method which 
switches the synonymous codons of viruses to 
those of their host organisms has been reported to 
improve the immunogenicity of HIV-1 and influenza 
A virus (Ramakrishna et al., 2004; Wang et al., 
2006a, 2006b). For now, the preferred codons of 
the overall mammalian system are used in the 
codon-optimization process, but we observed that 
there were various synonymous codon biases 
even among CTLD genes of both human and 
mouse species. Although those differences might 
be due to the chromosomal region which each 
gene was transcribed from, or other factors, one 
thing is clear that the preferred codons of host 
organisms are more various than we thought. In 
the case of CTLDs of human and mouse host, the 
group 1 genes were commonly transcribed from 
the chromosome 6 or 12. 

On the other hand, human CoV (HKU1) which is 
included in group 2 CoVs showed the most distinct 
synonymous codon usage biases in both % GC 
contents and RSCU patterns (Figure 1, 2), which 
agrees with the results from Woo et al. (Woo et al., 
2005). Woo suggested that it might be because 
human CoV (HKU1) may have originated from a 
major recombination event and numerous minor 
recombination events among group 2 CoVs. In this 
study, the nucleocapsid genes of human CoV 
(HKU2) were found on the opposite side from other 
group 2 CoVs along with the PRIN2 (Figure 1A), 
which showed high relationships with GCaq in 
Table 1, and they also revealed the opposite 
RSCU patterns from other group 2 CoVs on the 
basis of Dim1 in Figure 2B. 
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In Figure 3, we compared the RSCU profiles of 
both nucleocapsid and spike genes of human 
SARS-CoVs (AY290752, AY627048) with other 
genes such as the nucleocapsid gene of bat 
SARS-CoVs (DQ071614, DQ0412043), the group 
1 CLECs of human (NM_130441) and mouse 
(NM_025809), which were most closely located 
with human SARS-CoV, and other CLECs of human 
(NM_006344) and mouse (NM_016751), which 
showed distinct patterns from CoVs. As a result, 
the nucleocapsid genes of both human and bat 
SARS-CoVs did not use two synonymous codons 
for cysteine as well as ACG for threonine at all 
(Figure 3A), whereas other CoVs used them (data 
not shown). Among SARS-CoVs, the RSCU profile 
showed somewhat different patterns between the 
nucleocapsid and spike genes, especially in the 
phenylalanine and leucine encoding codons, and 
spike showed more biased patterns in U-ended 
codons such as CCU (RSCU = 2.34), and UCU 
(RSCU = 2.67) for profile and serine, respectively. 
In general, the RSCU value would be 1.00 if there 
is no codon usage bias. As for the human clec4C_1 
and mouse clec14A, they showed very similar 
profiles with spike genes, especially with bat 
SARS-CoV, in the arginine coding groups, showing 
the high RSCU values over 2.50 in AGA. The 
human clec10A_2 and mouse clec4F showed 
more biased patterns in GC-ending codons such 
as CUC, CUG and UCC for leucine and serine 
encoding codons than those of the group 1 CLECs 
(Figure 3D). 

Consequently, our study demonstrated that the 
nucleocapsid genes of CoVs might be more 
heavily affected by the synonymous codon usage 
than spike genes, and the CTLDs of human and 
mouse were partially overlapped with the nucleo- 
capsid genes of CoVs. Furthermore, we showed 
that the group 1 CTLDs of human species were 
commonly derived from the chromosome 12, and 
those of mouse were from the chromosome 6 and 
12. This suggests that there might be a specific 
genomic region or chromosomes which show a 
more similar synonymous codon usage pattern 
with viral genes. We also found the similar results 
between CoV genes and other human or mouse 
genes in our preliminary stage (data not shown). 
Our findings might be helpful for developing the 
codon-optimization method in DNA vaccines, and 
further study is necessary to determine a specific 
correlation between the codon usage patterns of 
coding sequences and the chromosomal locations 
where they are transcribed from in higher orga- 
nisms. 
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Methods 


Nucleotide sequences 


The nucleocapsid (251 sequences) and spike (284 sequen- 
ces) genes of Coronavirus genus including SARS-CoVs 
were collected from the NCBI Taxonomy Browser 
(www.ncbi.nlm.nih.gov/Taxonomy/) in GenBank format, and 
then, all the GenBank flat files were parsed into each 
category such as accession number, species name, gene 
name and sequence length using JAVA codes to construct 
a local database to facilitate the further computational 
works. As for the human and mouse species, we collected 
the coding sequences of human (homo sapiens) and 
mouse (mus musculus) from the genome section of NCBI’s 
FTP site (ftp.ncbi.nih.gov/genomes/), and also parsed and 
constructed a local database. All the gene names, abbre- 
viations, sequence lengths and their GenBank accession 
numbers of CTLD genes used in this study are shown in 
Supplemental Data Table S1. Abnormal sequences which 
include unknown characters except for A, G, C or U were 
not divided by three - maximum length of a codon unit - 
were removed. MySQL database management system 
was used to construct all the local databases on Linux 
operating system. 


Principal component analysis of % guanine-cytosine 
contents data 


Principal component analysis was performed using the % 
guanine-cytosine (GC) contents of the first (GC1st), second 
(GCana) and third (GCs) position of each codon, which 
were calculated for the nucleocapsid and spike coding 
genes of Coronavirus genus as well as the CTLD genes of 
human and mouse species. All the screened target se- 
quences were extracted from our local database first, then, 
each coding sequence was parsed into each codon unit. 
From the pool of codon units for each sequence, we 
calculated the % GC contents on the first, second and third 
codon position. JAVA was used in all calculation pro- 
cesses, and the SAS 9.1 statistical program (Cary, 2004) 
was used for the principal analysis. 


Phylogenetic analysis 


Twenty nine sequences of the CTLD genes from human 
and mouse species were used for the multiple sequence 
alignments using the ClustalW ver. 1.83 program (Thomp- 
son et al., 1997) with default parameters that set the DNA 
weight matrix as the UB matrix, and values of gap opening 
and gap extension penalties as 15.0 and 6.66, respectively. 
The Neighbor-Joining method with 1000 times boot- 
strapping process were performed using PAUP* ver. 4.0b 
program (Swofford, 1999). 


Correspondence analysis 


The correspondence analysis (CA) method was used to 
compare the RSCU values for the 59 codons described 
above using the SAS 9.1 statistical program (Cary, 2004). 
The RSCU value is the number of times that a particular 
codon is observed relative to the number of times that the 


codon would be observed in the absence of any codon 
usage bias. If there is no codon usage bias, the RSCU 
value is 1.00. The RSCU was calculated as 


Xj 
RSCU, =—— 


1 nj 
le ij 
Nj jal 


where Xj; is the frequency of occurrence of the jth codon for 
the ith amino acid, and n; is the number of codons for the 
ith amino acid. Each gene is represented as a 59-dimen- 
sional vector excluding the start and stop codons and 
UGG, which codes for tryptophan which has no synonyms 
(Sharp and Li, 1986). We assigned each kind of gene or 
species as rows, and RSCU values of 59 codons as 
columns in an input data set for CA. The biplot graph from 
a CA includes the best two dimensional representations of 
the data, along with the coordinates of the plotted points, 
and a measure of the amount of information retained in 
each dimension. CA uses chi-square to standardize the 
frequency values, so the distance between two coordinates 
with the same row or column value indicates the 
chi-square distance (Hair et al., 1998). If this distance is 
long enough to have statistical meaning, the coordinates of 
the output plots along with each column or row direction 
will be located far from the origin, and they usually exist on 
the opposite side of each coordinate axis. The distance 
between each row or each column reveals the Euclidean 
distance (Gu et al., 2004; Perriére and Thioulouse, 2002), 
but there is no meaning between the row and column 
coordinates. 


Other statistical analysis 


Linear regression analysis was conducted to determine the 
correlations between the first two dimensional factors 
(Dim1 and Dim2) of the CA results, and the % GC contents 
on each codon position, effective number of codons (ENC) 
and the average hydrophilicities of encoded proteins. ENC 
values were often used to measure the magnitude of 
codon bias, which yields values raging from 20, when one 
codon is used for each amino acid, to 61, when all 
synonymous codons are used in equal frequency (Wright, 
1990). We calculated each ENC value per each nucleotide 
sequence using JAVA codes, and all these analyses were 
performed using the SAS 9.1 statistical program (Cary, 
2004). 


Supplemental data 


Supplemental Data include a Table and can be found with 
this article online at http://e-emm.or.kr/ article/article_files/ 
SP-41-10-07.pdf. 
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